Skip to content

Conversation

@petercad
Copy link

@petercad petercad commented Oct 4, 2025

This PR updates FlashAttention to the new copy/MMA atoms.

Changes:

  • Prefill and decode unified into a single implementation, allowing simultaneous K and Q subgroup-level parallelization rather than an either-or.
  • GEMMs and softmax grouped together and the full k loop consolidated into an FMHA mainloop class.
    • This will facilitate further manual pipelining/overlap of GEMM with softmax.
  • Use new copy/MMA atoms and reorders to transparently support arbitrary data types.
  • Automatic copy/MMA operator selection.

Current status: prefill/decode examples working, similar/better performance to old examples.

Known issues:

  • Head size 192 decode config doesn't compile yet. Edit: fixed.
  • Strange SYCL compiler behavior/bug with tSrS->tArP reorder. Apparently the compiler believes there is UB somewhere and will omit a large section of the kernel as a result. For the moment, there's a direct copy as a workaround while I pin down the issue. I'm not able to reproduce this behavior with the reorder in isolation.

Additional features (causal masking, variable sequence lengths, etc.) to be added later.

Reminder: the new atoms require a very recent driver due to necessary IGC fixes/enhancements. Recommended version: ci-comp_igc-30613.

@petercad petercad changed the title [Umbrella commit] Re-implement FlashAttention with new Xe atoms Re-implement FlashAttention with new Xe atoms Oct 4, 2025
@petercad
Copy link
Author

petercad commented Oct 4, 2025

I will break up this large commit into self-contained smaller commits after review is complete.

@ClarkChin08
Copy link

ClarkChin08 commented Oct 23, 2025

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

output [991]: 2.696791 vs -nan

./examples/06_bmg_flash_attention/06_xe_fmha_fwd_decode_hdim128 --iterations=10 --batch=1 --num_heads_q=8 --seq_len_kv=256 --seq_len_qo=1 --num_heads_kv=8

However, when seq_len_kv is changed to 512 or higher, the example passes successfully.

@petercad
Copy link
Author

petercad commented Oct 23, 2025

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

@ClarkChin08 I pushed a patch to fix issues like this earlier today. I double-checked your test case, and it's passing on my system; can you double-check with the latest commit?

@petercad petercad force-pushed the petercad/rearch_sdpa branch from af2f402 to 326669e Compare October 23, 2025 03:54
@ClarkChin08
Copy link

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

@ClarkChin08 I pushed a patch to fix issues like this earlier today. I double-checked your test case, and it's passing on my system; can you double-check with the latest commit?

Yes, passed now.

@petercad
Copy link
Author

Note: the CI is currently failing with compile-time divide-by-zero errors, but I can't reproduce the errors locally with any compiler/compile flags. If anyone can, let me know.

@petercad petercad force-pushed the petercad/rearch_sdpa branch from f767eb5 to 10b0c97 Compare October 27, 2025 21:56
@petercad
Copy link
Author

Note: the CI is currently failing with compile-time divide-by-zero errors, but I can't reproduce the errors locally with any compiler/compile flags. If anyone can, let me know.

Didn't realize CI was merging branches into main prior to testing. Thanks to @rolandschulz for helping figure this out.

Branch is rebased now and split into a logical set of patches.

@petercad petercad force-pushed the petercad/rearch_sdpa branch 2 times, most recently from b0e30f4 to 7dd479b Compare October 27, 2025 23:19
@tdeng5 tdeng5 added the release label Oct 28, 2025
@petercad petercad force-pushed the petercad/rearch_sdpa branch from 7dd479b to 460d34a Compare October 28, 2025 15:37
@petercad petercad force-pushed the petercad/rearch_sdpa branch from 2bb6829 to 9f74e54 Compare October 31, 2025 17:33
@petercad petercad enabled auto-merge October 31, 2025 17:34
@petercad petercad disabled auto-merge October 31, 2025 17:34
@petercad petercad enabled auto-merge October 31, 2025 17:35
@petercad petercad merged commit 7ab29af into main Oct 31, 2025
5 of 6 checks passed
@tdeng5 tdeng5 deleted the petercad/rearch_sdpa branch November 1, 2025 04:22
@rolandschulz rolandschulz mentioned this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request release urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants